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The  purpose  of  this  study  was  to  develop  voice  authentication 
techniques  which  can  be  used  over  packet  switched  networks 
/ithout  the  limitations  of  a  single  fixed  reference  pnase. 

The  results  of  this  study  are  presented  in  the  two  manuscripts 
which  have  been  or  will  be  published  (in  February  1979)  in  the 
IEEE  Acoustics,  Speech  and  Signal  Processing  Journal.* 


The  results  of  this  study  were  significant  in  several  respects. 
First  of  all,  the  largest  data  base  of  controlled  conditions 
but  extemporaneous  human  speech  in  existence  was  developed  for 
the  project.  Secondly,  a  real-time  processing  capability  was 
developed  for  processing  this  data  base  of  40  hours  of  continuous 
speech.  Finally,  with  the  restriction  of  clean  speech  (not 
processed  over  a  telephone  or  with  noise) ,  text-independent 
speaker  recognition  results  nearly  matching  those  for  text- 
dependent  studies  were  achieved  by  averaging  over  30-40  second 
speech  segments, 
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Long-Term  Feature  Averaging  for 
Speaker  Recognition 

JOHN  D.  MARKEL  BtATRiCL  T.  OSIlIkA.  ^m>  M'CU 'STINE  H  GRAY.  JR  mimbi  k  iin 


.4ft>r'iBcf-rhe  poli'niial  brnefito  of  kmji-term  paramrlci  ivcraging 
fui  speaker  cccugniliun  Mere  invesligalcU.  Paianieters  sluUieU  Here 
pitch,  gain,  anJ  reflection  c<>efririen(.s.  I'arameter  variahiliiy  uas 
computed  over  various  averaging  lengtiis  from  one  frame  a-  eraging  (in 
^ee(,  no  averaging)  to  1000  frame  averaging  (about  70  s  of  speech).  It 
Has  demonstrated  that  the  betHeen-io-within  .‘speaker  variance  ratio, 
measured  over  seveial  speakers,  was  significantly  increased  hy  perform 
ingiong'lerm  averaging  of  the  parameter  sets.  The  reflection  coefficient 
averages  (or  and  respectively.  Here  shown  to  produce  the  highest 
variance  t.stios. 


I.  I.NTRODUCTION 

HERE  have  been  several  studies  on  llie  choice  of  acoustic 
features  in  speaker  recognition  tasks  |14).  |19|.  (221 
Average  fundamental  frequency  has  been  found  to  be  a  useful 
discriminating  feature  [13|,  as  have  gain  measurements  (21. 
(lO)  and  long-term  speech  speciia  14|-l6|.  (9).  Perceptual 
studies  indicate  that  “there  is  at  least  weak  evidence  that  a 
voice  that  is  distinctive  to  listen  to  also  has  distinctive  spectro- 
graphic  patterns”  (20] ,  and  that  dimensions  of  "characteristic 
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inlvh"  and  “ch.iracteristic  loudness"  may  be  posited  to  differ 
entiale  among  speakers  (21  (.  Tliese  speaker  characteristics 
can  be  distinguished  from  the  acoustic  cues  which  signal  lin¬ 
guistic  elements,  e  g.,  phonemes  oi  words  Eor  e.vample.  the 
icah/ation  of  the  word  "bit"  by  a  lemale  child  is  acoustically 
very  ditierenl  from  the  same  word  pronounced  b>  an  ad.ilt 
male,  yet  the  wi'tds  are  generally  understood  to  be  equivalent 
while  tlie  speakers  ate  clearly  diflerent  It  appears,  the.;,  that 
lisleneis  adapt  to  speakers'  loiVe  charactenstii  s  (as  well  a 
their  /m.i;irisf/t  charactetistics). 

All  this  suggests  that  there  are  long-term  charactetistics 
which  can  be  used  in  texi-independer.i  speaker  recognition 
tasks.  Such  characteristics  include  long-term  averages  related 
to  fundamental  fieqiiencv .  gain,  and  spectral  averages 

The  motivation  for  long-term  averaging  in  text-independent 
speaker  recognition  is  based  upon  a  result  from  statistical  sam¬ 
pling  theory . 

We  assume  that  dellnes  statistically  independent, 

identically  distributed  samples  of  the  parameter  p  with  true 
mean  and  variance  o^.  (Eor  example.  {ki(iT}  cotrespond; 
to  the  rellection  civefficienl  samples  for  each  analysis 
frame.)  If  .v  =  (/’(/))  defines  a  feature  based  upon  long-term 
averaging  ofp.  where 

I  f-t.-i 

V  pit).  {\ 

and  /,„  is  the  number  of  voiced  analysis  frames  used  in  the  .iver 
aging,  then  the  variance  of  .v  is  given  in  terms  of  the  origina 
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paiJiiietcr  vaiuiice  Op  b> 

Tile  sample  variance  as  a  liinc.ion  »il  1  ^  is  an  iiiip'Ti  uit  liiMiro 
of  merit  for  a  particular  leatur  .  1  or  example,  if  tlie  Icatures 
are  mote  tightly  clustered  togetlui  about  the  sample  mean  as 
/.„  increases  fioiii  /  ,  ■=  1  (no  averaging),  then  ihe  inlraspcaker 
variability  is  deeieased,  and  the  parameters  would  b>  ,.specled 
to  result  m  biglier  |ierfotmance  in  texi  uiilependeiit  speaker 
recognition  tasks.  Althougli  no  “true  iiican  or  true  .aiiaiice" 
exists  lor  real  speech  because  of  physiological  vaiiatioiis  m 
human  speech,  it  is  leasonable  to  assume  thai  at  least  some 
convergence  or  clustering  of  parameters  will  oecui  villi  long¬ 
term  averaging. 

The  purpose  of  this  pape  is  to  delinc  seveial  sets  ol  poten¬ 
tially  useful  long-term  features  arid  then  to  invesligaie  their 
statistical  properties  as  a  function  of  the  averaging  length  / 

In  addition,  discrimination  tests  are  presented  over  a  small 
homogeneous  set  of  speakers  to  illustrate  the  potential  benefits 
of  long-term  averaguig  for  unconstrained  lexl-mdependent 
speaker  recognition. 

11.  I  tATUIIKS 

To  discuss  the  applicability  of  long-teim  feature  averaging  in 
a  quantitative  manner,  we  have  chosen  three  ditterent  feature 
sets  as  the  basis  for  analysis.  Some  of  these  fealures  relied 
physiological  characteristics  more  closely  than  others. 

,4.  Fundamental  f  rcqiiency  Featurei 

Due  to  physiological  considerations  such  as  the  length  and 
thickness  of  the  vocal  folds,  and  respiratory  muscle  patterns, 
the  phonatioii  of  a  particular  vowel  with  "normal  effort"  may 
result  m  differing  rates  of  vocal  fold  vibration  (corresponding 
to  the  acoustical  correlate  of  fundamental  fteqiiency)  lor  dif¬ 
ferent  speakers.  For  example,  a  child  will  have  a  high  funda 
mental  frequency  compared  to  an  adult  because  of  the  child's 
smaller  vocal  folds. 

Althougli  fundamental  frequency,  along  with  intensity  and 
duration,  is  a  controllable  attribute  of  stress  and  intonation 
which  may  vary  widely,  each  person  apix'jts  to  have  a  mean 
fundamental  frequency  value  which,  if  averaged  over  a  suffi¬ 
ciently  long  period  of  time,  is  relatively  constant  over  a  reason¬ 
able  time  span  and  is  independent  of  linguistic  content  [K] . 

In  addition,  (he  standard  deviatiem  of  the  fundamental  fre¬ 
quency  over  a  long  inter.al  of  time  may  carry  important 
speaker-dependent  information.  For  example  if  the  speaker 
is  judged  to  be  a  monotone  speaker,  then  the  standard  devia¬ 
tion  would  be  expected  to  be  relatively  small.  However,  if  the 
speaker  is  thouglit  to  be  an  “expressive"  or  "forceful"  speaker, 
it  would  be  expected  to  be  relatively  large. 

B.  A  Gain  Feature 

It  seems  reasonable  to  assume  that  one  of  the  characteristics 
that  contributes  to  a  speaker’s  identity  is  the  amount  of  inten¬ 
sity  or  gain  variation  in  his  speech  over  time.  Subjectively,  the 
amount  of  gain  variation  is  possibly  correlated  with  the  per¬ 
ception  of  “dynamic"  versus  “flat”  voices.  Ihe  actual  gain 
variation  is  also  a  function  of  phonetic  content,  word  and 


phrase  sl^e^s.  and  di->ci  urse  context,  l  or  example,  for  a  con¬ 
stant  siibgl,>ll.il  prcssiiic.  the  acoustical  imtput  energy  lor  an 
a'  IS  about  5  dH  giealer  than  lor  a  ,'u'  \lso.  a  larger  gain 
variation  would  be  expected  with  an  exclamaloiy  as  opposed 
to  a  normal  declarative  sentence.  Oui  assumption  is  that,  over 
a  si'fticiently  long  interval  of  speech,  gain  vaiiation  can  be 
considcicd  p.iit  ol  llu  mdivid.ial  speal-ei's  i'iaracteristi,.s. 
Piat  IV.  a  speaker  who  is  judge  i  overall  lo  be  an  "empliatic" 
speaker  will  have  laiget  gain  vari.ition  Ilian  one  who  is  judged 
to  base  a  usually  nioimli'iious  voice. 

Ill  the  measuiement  ol  gam  varialion.  it  is  very  ii.iporlani 
lliai  results  be  only  a  liiiiclion  of  speaker  ebaiaeterislics  and 
not  .ibsolute  system  gam.  Fiutbermorc.  ccauve  ol  llie  dis- 
lincllv  diltetent  piodaclion  mechanism  between  voiced  and 
unvoiced  speech,  it  is  desirable  li'  measure  tin  gam  variations 
I'lily  during  voiced  speech.  A  norinali/ed  gam  ariation  winch 
satisfies  desired  physical  piopertics  is  now  delined.  If  A’(»i( 
defines  the  energy  ol  \  speech  sample^  cii/l)  in  Iramc  n.  then 

V  I 

/v’(/i|=  V  s^ill  (.)) 

/•o 

Hie  sample  mean  and  sample  variance  ol  A’('i)over  /  ,  voued 
frames  is  then  defined  by 

A^vAM/D)  i-») 

and 

ui(=HA('i)  A  )•>  (.‘^1 

where  <■)  will  be  used  ibrouglioul  to  denote  averaging  over  /  ,. 
voiced  frames.  The  notmali/.ed  gain  variation  fi  is  then  defined 
by 

h  =  Of; 'R.  ( b  I 

If  the  I'vcrall  system  gam  is  changed  by  a  constant  value.  S  is 
unallectcd.  Furthermore.  A  is  nonnegaiive  with  A  =  0  only 
when  (1^  =0.  I’bysically.  A  =  0  moans  Ilial  the  speech  enve¬ 
lope  (more  |iieciscly  Ihe  trame  energy )  is  unchanged  over  the 
complete  range  of  voiced  speech  analyzed. 

C.  Spet  trul  Featurei 

it  is  well  established  in  the  litcratuie  that  one  of  the  acous'i- 
cal  fealures  that  tends  lo  differentiate  one  pailiciilar  speaker 
from  another  during  voiced  speech  production  is  Ihe  glottal 
.sound  source  shape  1 1 5  i . 

Althougli  the  spcciial  slope  of  a  singk  glottal  pulse  can  vary 
over  a  wide  range  from  nearly  w  hispered  speech  to  very  intense 
vocal  efiort,  for  normal  conversational  speech  it  is  expected 
that  an  average  glottal  source  spectrum  could  be  obtained  over 
a  relatively  long  interval  of  speech  that  would  have  relatively 
small  intraspeaker  variability. 

Unfortunately,  glottal  volume  velocity  waveform  estimation 
from  speech  is  a  nontrivial  task  (7| .  |I21 ,  |l6l .  A  more  di¬ 
rect  method  for  automatic  real-time  analysis  is  to  use  a  param¬ 
eter  set  that  is  related  to  the  smooth  characteristics  of  the 
spectrum,  wliich  is  independent  of  fnndamenial  Ircqtiency  or 
gain.  With  linear  prediction  analysis,  obviinis  possibilities  are 
filter  coefficients,  rellection  cocflu  ienls,  or  log  area  fim.iiions. 
Sambur  1 17]  compared  these  coefficients  in  a  speech  recogni- 
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lion  experiment  and  decided  lo  make  uce  of  the  rellcelion 
coelticients.  Allliougfi  retlcetion  coefficients  are  nonlmcatlv 
related  to  the  more  physically  meaningful  smooili-speciral  and 
log  spectral  model  from  lineai  prediction  analysis,  there  is 
ample  evidence  that  the"  do  contain  important  speaker- 
dependent  infomiation  that  is  not  contained  in  fundamental 
frequency-  or  gam-related  paraineteis.  For  example,  in  the 
case  of  a  first-order  filter,  iW  =  I  a  smootli  spectral  model  can 
be  physically  and  mathematically  related  to  the  first  tetleeliun 
coefficient  This  model  (11,  p.  139|  ha'  a  spectral  flatness 
given  by 

2(1/4)=  I  i?.  (7» 

II  liie  speech  sample  being  analyzed  has  a  nearly  fiat  spectral 
trend,  A:,  approaches  zero  and  the  spectral  flatness  appToa,hes 
unity.  As  the  spectral  slope  increases  negatively,  A,  approaches 
■1  and  the  spectral  llalness  approaches  zero. 

Based  upon  the  spectial  niatchmg  properties  of  linear  predic¬ 
tion  (11,  p.  134),  we  would  assume  that  preemphasis  tif  the 
data  would  be  beneficial  since  the  reflection  coelficienis 
would  then  carry  more  information  about  the  spectral  struc¬ 
ture  at  higher  frequencies 

It  would  also  seem  reasonable  that  it  long-term,  spectrally 
related  features  are  desired  which  minimize  intravariability, 
only  Voiced  speech  should  be  analyzed.  Substantial  differences 
exist  in  the  physiological  mechanisms  which  produce  voiced 
and  unvoiced  sounds.  Since  the  excitation  for  unvoiced  speech 
is  generally  assumed  to  have  a  flat  spectrum,  the  diflerence  in 
spectral  slope  between  voiced  and  unvoiced  sounds  may  be  on 
the  order  of  8- lb  dB  With  only  voiced  sounds,  some  variation 
will  still  occur  since  different  articulator  positions  will  cause 
variations  on  the  acoustic  loading  at  the  glottis,  affecting  the 
glottal  source  shape.  This  variaiion,  however,  is  expected  to 
be  substantially  less  than  that  due  to  glottal  source  variations 


in  voiced-unvoiced  speech  production. 

D.  Summary  of  Feature  Dejmitions 

M  features  we  study  the  following. 

1)  Fo  average 

-’ti  =Fo  =<Fo(»i)>.  («) 

2)  Standard  deviation  of  /-'o 

=<lFo(n)-^o|’>■'^  (9) 

3)  Sample  gain  variation 

Xi=OKlR  (10) 

where 

R=(R(n))  (II) 

and 

=<(F(rt)- (12) 

4)  Spectrally  related  features  (reflection  coefficient  averages) 

Jf/va  =<*■(«)>  for  i=  1.2,---,.M.  (13) 

The  feature  vector  x  is  defined  by 

(Xiotj  •  •  xj.^1  (141 

where  T denotes  transpose. 
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III.  I’klK'KDl'KI  S 

.1.  IkllU 

Hie  data  used  lc>r  the  analysis  were  obtained  during  i: 
views  of  four  speakers,  l  ach  interview  was  then  edited  so  that 
only  the  interviewees'  voices  rer  lained  Tlie  total  duration  of 
each  edited  interview  (including  pauses)  was  typically  l.v- 
18  min.  The  total  data  base  used  loi  this  study  wasapproxi 
maiely  one  hour  ip  duration.  No  special  precautions  in 
recording  conditions  were  imposed  on  the  experiment.  Inter 
views  were  conducted  in  normal  room  environments  with  a 
dynamic  microphone  and  an  audio  t.ipe  recoidei  So  that  a 
small  numbei  of  speakers  could  he  used  with  some  geneiality 
ui  extrapolating  results,  a  homogeneous  population  ot  four 
male  speakers  was  chosen,  each  having  somewhat  similar 
speech  characteristics  and  relatively  narrow  fundamental  tre- 
quency  ranges.  Histograms  of  the  raw  nonaveiaged  fundamen 
tal  frequency  values  showed  substantial  overlap  among  the 
tour  speakers. 

li.  Piyital  Prinessing  of  l\tta 

The  audio  tape  was  digitally  processed  using  the  system 
shown  in  Fig.  I .  Each  test  segment  was  recorded  onto  a  disk 
using  cc  nveiitional  procedures.  .\  novel  part  of  the  procedure 
I  based  upon  the  use  of  a  higli-speed  signal  processor  an.!  os¬ 
cilloscope  (for  visual  I'eedback  during  processing).  Hsing  ar. 
array -processmj’  sottware  system,  it  is  possible  to  priKCSs  the 
data  in  real  time  at  a  50  Hz  analy  sis  frame  rate  from  a  Foiti  m 
environment.  Processing  includes  modified  cepstra'  pitch 
[leriodand  voicing  detection,  gain  calculation,  linear  prcdietion 
analysis  for  reflection  coefiicients.  and  a  running  mean  and 
mean -square  computation  of  these  parameters. 

The  procedure  (or  generating  output  feature  vectors  to  be 
used  in  the  statistics!  analysis  is  shown  in  Fig.  2.  .-3  counter 
for  flame  n  is  incremented  and  one  frame  of  speech  is  ana¬ 
lyzed.  Tlie  parameters  used  are  sampling  frequency  = 
0.5  kHz.  number  of  analysis  coefficients  .'/ =  10.  numbei  of 
samples  toi  reflection  coefficient  computation  =  128.  and  the 
number  of  samples  for  /-'o.  aiiu  gam  parameter  analysis  =  25b 
(40  ms)  The  analysis  frame  talc  is  50  Hz.  Preemphasis  of  liie 
speech  data  is  applied  using  a  diKcrcncer,  1  -  r"' . 

Fundamental  frequency  estimation  is  perfonned  with  a  mod 
tiled  cepstral  technique.  After  the  spectiuin  has  been  com 
puted, a  symmetrical  window  fiinction  is  a|iplied  that  smoothly 
tapers  from  unity  at  I0(X)  Hz  lo  zero  from  1500  Hz  to 
This  simple  modification  resolves  most  of  the  voicing  problems 
one  obtains  with  the  usual  cepstral  analysis  method  since  only 
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the  most  consistent  region  of  hannonic  structure  is  used  [31 . 
Two  frames  of  delay  are  included  in  the  system  so  that  some 
amount  of  error  detection  and  correction  can  be  applied  in  the 
pitch  period  estimation.  One  additional  test  has  been  found 
necessary  for  obtaining  meaningful  feature  vectors.  \ 
max  (Fo)  snd  min  (Fo)  value  are  chosen  for  the  speaker  being 
analyzed  to  ensure  against  gross  errors  causing  the  fundamen¬ 
tal  frequency  features  from  being  dramatically  affected  If 
min(F*)<Fo  <max(Fo).  the  frame  is  judged  to  be  voiced 
and  the  long-term  averages  are  updated.  The  frame  counter  is 
incremented  and  if  />L„,  the  resultant  features  vector  x  is 
output  to  disk,  /  is  reset  to  zero,  and  analysis  then  continues. 

IV.  Experiments 

A.  Experiment  I -  Statistical  Variation  as  a  Function  o)  1.^. 

The  complete  edited  audio  tape  for  .speaker  D  (approxi¬ 
mately  18  min  in  duration)  was  analyzed  to  extract  long-term 
averaged  feature  vectors  for  several  L„  conditions  As  a  time 
reference,  L„  =  1000  corresponds  to  approxunately  70  s.  The 
total  number  of  vector  samples  obtained  is  approximately 
inversely  proportional  to 

Tlfc  unbiased  variance  estimate  of  the  feature  x  =  </>(i)> 
based  upon  the  speech  parameter  p  is 


Standnd  dcvialion  of  gam  rcialed  Icatuics  as  a  lunclion  ol  ihe 
numbci  of  voiced  frames  /.j.. 


tion.  Note  that  Lf  is  actually  a  function  of/.„  since  the  total 
duration  is  fixed.  The  sample  mean  .v  is  thus  independent  of 
/.j,  except  for  sampling  variation  in  the  real-time  analysis  be¬ 
cause  it  IS  not  possible  to  start  analysis  at  precisely  the  same 
IcKation  on  the  audio  tape  when  /.,.  is  changed.  The  true  vari¬ 
ance  Op  is  estimated  from  oj  =  o’(.v)  with  /.„=  1  Features 
which  themselves  are  based  on  variances  (such  as-Xj  =  Op^  and 
X}  =  Opi'R)  do  not  allow  for  a  tiue  variance  estimate  The 
sample  standard  deviations  of  the  fundamental  ftequency- 
relaled  features  are  shown  in  Fig.  .'  as  a  function  ol  /.,.  The 
estimated  standard  deviation  about  the  long-teim  fundamenlal 
frequency  averages  is  reduced  from  about  18  11/  tor  1.^  -  10  to 
about  6  Hz  for  /.,.  =  1000.  These  values  are  somewhat  higher 
than  the  long  term  h'o  averages  reported  by  Horii  18].  How¬ 
ever,  tl'.is  experiment  is  based  upon  unconstrained  conversa¬ 
tional  speech,  whereas  Horn’s  expc'iment  was  based  upon  a 


where 


Etch  p(i)  explicitly  denotes  an  individual  feature,  and  Lf  is  the 
number  of  feature  vectors  obtained  over  the  total  speech  dura- 
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(b) 

I  ig.  5.  (a)  Standard  deviation  of  rellection  coefficient  averages  as  a 
function  of  the  number  of  voiced  frames  y.^,.  (ilio'  deviations. 

Ibf  Standard  deviation  of  retlection  coefficient  averages  jn  a  function 
of  the  number  ol  voiced  frames  rnis  of  all  coefficient  vanaiucs. 

rapidly  as  predicted  by  sampling  theory  for  the  case  of  inde¬ 
pendent  samples  because  of  intraspeaker  variability,  the  de¬ 
crease  is  substantial  and  is  surprisingly  linear  on  a  log-log 
scale.  Instead  of  a  relation,  the  standard  deviation  of 

the  reflection  coefficient  features  appeais  to  approximately 
decrease  proportionally  to  a  model  beyond  A,,  =  10. 

The  rms  deviation  over  all  <At,)  averages  is  shown  in  Fig.  5(b). 
Over  a  range  of  A,,  from  10  to  1000,  the  model  is  still 
seen  to  be  very  accurate  for  predicting  the  decrease  in  rellec- 
tion  coefficient  feature  parameter  variation  as  1,^  i.s  meteased. 
The  measured  exponent  value  is  certainly  dependent  upon  the 
particular  speaker.  However,  it  appears  to  vary  only  slightly 
from  the  model  discussed  for  the  several  other  speaker 
measurements. 

The  estimate  of  the  true  variance  for  the  *1 ,  Ajo,  and  overall 
parameter  variance  is  also  shown  in  Fig.  S(a)  and  (b)  at  A„  =  I . 


A  second  way  of  qualitatively  showing  the  effect  of  long¬ 
term  averaging  is  to  show  two-dimensional  scatter  diagrams  for 
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Various  values  of  1.,..  I  ig.  6  shows  a  Staffer  ph  t  oi  tAj)  versus 
<Ai>  samples.  Katli  pomi  is  based  upon  A,,  samples  from  the 
edited  audio  tape  for  speaker  /i  Shown  with  the  data  arc  two- 
sigma  ellipses  with  the  principal  axes.  .\  dramatic  decrease  m 
the  dispersion  of  the  data  is  seen  as  A,  increases. 

H.  h'xperiment  2  Discrmwiation  as  a  f  'iinc:u>n  of  I , 

The  approach  taken  here  is  to  investigate  the  effectiveness  of 
long-term  averaging  for  speaker  recognition  using  the  ratio  ol 
the  bclween-speaker  variance  and  the  within -speaker  variance, 
without  specifying  particular  speaker  recognition  experiments. 
Since  the  mathematics  of  this  procedure  (Fisher  discriminant 
method )  is  discussed  elsewhere  ( 1 1 ,  oiil>  the  necessarv  details 
will  be  summarir.ed  below  . 

A  within-speaker  covariance  matrix  U' is  computed,  and  then 
a  normah/ed  between-speaker  covariance  matrix  B'  is  found  m 
terms  of  the  matrix  B  of  Brickcr  et  al.  1 1 1  from 

B'  =  Bjl  f  ( 1 7) 

where  Lf  is  the  number  of  feature  vectors,  llte  normali/alion 
is  included  so  that  R'  will  dei'end  only  upon  the  sample  means, 
not  upon  the  number  of  feature  vectors  Figenvaluev  and 
eigenvcc'tots  of  the  equation 

B  k  =  (I8> 

are  then  obtained.  Tire  eigenvalues  are  ordered  Ifom  higliest 
to  lowest,  and  as  the  number  of  speakers,  four,  is  less  than  the 
number  of  features,  tliirtcen.  all  but  the  first  three  cigcii.alues 
are  zero  | I | 

X,  >X:  =X,  =---  =  X,j  =0.  (I*!) 

A  new  coordinate  system  is  defined  using  the  eigenvectors  of 
(18;  as  base  vectors,  so  that  the  new  coordinate  vector  »•  is 
related  to.r  througli  the  linear  transformation 

(20) 

where  is  the  matrix  whose  columns  are  the  eigenvectors  of 
(18).  The  eigenvalues  of  (18)  represent  tltc  variance  ratios  in 
the  directions  of  the  eigenvectors  with  X,  being  the  maximum 
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TRANSFORMED 

FEATURES 

VARIANCE  RATIO 

1 

Lv*IOO 

Lv  *  1000 

yi 

10.959 

115  368 

it 

0.972 

7956 

it 

0.393 

1730 

74 

0 

; 

0 

yi3 

0 

• 

0 

variance  ratio,  Xj  being  the  next  largest  (in  a  direction  orthog¬ 
onal  to  ^i),  etc.  Vaiiance  ratios  can  also  be  computed  in  the 
original  coordinate  system  as  a  method  for  measuring  relative 
effectiveness  of  features. 

[  Tables  1  and  II  show  the  variance  ratios  in  the  original  and 
transformed  coordinate  systems,  respectively,  for  /,„  =  1 00  and 
L„  =  1000.  Except  for  the  fact  that  parameter  correlation  is 
r  not" taken  into  account,  the  variance  ratio  values  can  be  taken 
I  as  quantitative  measures  of  the  original  parameter’s  effective- 
\  ness  in  speech  recognition.  For  example,  we  see  that  O/r^  pro¬ 
vides  very  little  discrimination  among  speakers,  whereas  <Arj> 

I  appears  to  provide  the  maximum  discrimination  among  speak 
ers  over  all  parameters  If  is  seen  that  the  first  dimension  in 
the  new  coordinate  system  results  in  a  substantialy  increased 
variance  ratio. 


Vi 


Fip.  7.  Scatter  pIoU  for  speakers  .4-0  along  first  three  1  isher  discrimi 
nan!  dimensinns  (/.g. 
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l  iH  R.  Scatter  pints  foi  speakers  .1-/)  alont;  I’lr  three  Kishet  disciimi- 
najit  dimen lions  (i,„  1  1000). 


Two-dimensional  scatter  plots  of  the  first  three  transloimed 
dimensions  ate  shown  in  Fig.  7  tor  /.„  =  100  and  in  Fig.  8  for 

=  1000  The  results  are  based  upon  the  four  speakers  .4-/). 
Also  shown  are  two-sigma  ellipses  and  the  principal  axes  for 
each  speaker  distribution.  In  Fig.  7  it  is  seen  that  A  and  D  are 
essentially  uniquely  separated  from  B  and  C  in  at  least  one 
plane  (yi  -  .Vj).  A  relatively  large  overlap  does  occur,  how¬ 
ever.  for  B  and  C  in  all  planes  A  cursory  comparison  of  Fig  7 
and  the  relative  si/.es  of  clusters  in  Fig.  6  will  illustiatc  that 
substantial  benefits  in  discriminating  against  different  speakers 
have  been  obtained  over  using  no  averaging  (/,„  =  1 )  or  very 
limited  amounts  of  averaging  (/.„  =  lO): 

In  Fig.  8,  it  is  seen  that  by  performing  long  term  averaging 
with  /,„=  1000,  perfect  discrimination  is  obtained,  in  this 
instance  based  upon  only  a  two-dimensional  uansforincd  fea¬ 
ture  representation. 

The  variance  ratios  for  the  input  feature  variables  are  shown 
in  Table  I  for  /.„  “  100  and  /.„  =  1000  If  the  vari.ibles  were 
statistically  independent,  these  ratios  wtiuld  differ  by  a  multi- 
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plioative  t.icior  ol  10  raihtr  tlun  ihe  smullci  lavtots  iiiduaied 
in  the  tigjre.  The  orde'ing  ol  the  lealures  m  tenii)  ol  laiian^e 
ratios  is  of  some  mteresi.  The  fifth  feature.  A-, .  clearly  shous 
the  largest  vanaiwe  ran.  ,  with  the  nrnth  'eature.  ,  the  next 
largest.  These  coelficients  correspond  to  the  coeftic.ents  for 
the  highest  power  ofc*'  in  the  models  of  order  2  and  6  found 
from  linear  prediction  analysis.  The  two-polc  model  has  been 
used  in  earlier  recognition  tasks  [18] 

The  variance  ratios  for  the  first  three  leatures.  lundamental 
frequency,  its  standard  deviation,  and  sample  gam  variation, 
are  smaller  than  what  one  might  expect  from  intuition  Part 
of  the  reason  may  lie  in  the  fact  that  the  speakers  were  chosen 
to  have  similar  fundamental  frequency  ranges 
The  variance  ratios  for  the  new  coordinate  system,  the  eigen 
values  of  (18).  are  shown  in  Table  II  From  the^  ratios  and 
the  scatter  diagrams  of  Fig.  8.  it  can  be  seen  that  verv  clear 
separation  of  the  speakers  is  indicated  for  the  long  term  aver¬ 
age  case  of  Z.^  »  inoo  by  using  only  the  first  two  coordinates. 
>'i  and.fj .  in  the  direction  of  the  eigenvectors  and 

Disi  ussion 

.-1.  Parameter  VariabiUry  Over  Days,  Weeks.  Pn 

This  initial  study  has  been  resi''-'ted  to  the  siuly  of  long¬ 
term  averages  taken  from  one  sess'on.  This  is  probably  the 
reason  why  the  standard  deviation  of  the  long-term  averages 
tends  to  have  a  monotonically  decreasing  behavior.  .-Mthough 
some  amount  of  intraspeaker  variability  is  reflected  m  Ihe  data, 
additional  vanabiliiy  will  occur  when  results  are  obtained  from 
sessions  separated  by  days,  weeks,  or  months  later.  In  several 
studies  over  linguistically  constrained  umts.  this  effect  has 
been  shown  to  be  severe  beyond  several  inonths  for  short  text- 
dependent  segineiils  (4|.  A  large  data  base  e.xtending  over 
several  months  is  now  being  grneraied  for  studying  these  el- 
leets  in  conversational  speech. 

B.  Accuracy  of  Voicing  Decisions 

Since  all  long-term  statistics  are  made  only  during  voicing,  it 
is  very  important  to  Know  that  realistic  voicing  decisions  are 
made.  Spectral  slope  and  normalized  gam  variation  are  direst 
computations  requiring  no  d  -cisions  (except  for  voicing)  and 
are.  therefore,  very  robust. 

If  the  threshold  sett  ng  for  voicing  and  pitch  period  detection 
IS  set  too  high  or  too  low.  the  effect  can  be  caiastrophic  At 
one  extreme,  if  the  voicing  threshold  is  too  high,  ve.-y  few 
frames  will  be  included  in  the  statistics  as  being  voiced  (al 
though  they  will  be  very  reliable  estimates)  and.  furthermore, 
transitions  in  which  considerable  fundamental  frequency  varia¬ 
tions  may  occur  are  likely  to  be  missed,  causing  the  measured 
fundamental  frequency  standard  deviation  to  be  unrealisticallv 
small. 

At  the  other  extreme,  if  the  threshold  is  too  low  ,  there  will 
be  a  tendency  to  define  fundamental  frequencies  near  the 
maximum  allowable  frequency  (minimum  pitch  period)  (near 
400  Hz)  during  actual  voiced  speech  and  at  random  values 
throughout  the  rest  of  the  allowable  range  during  unvoiced 
speech.  Although  a  pitch  period  and  voicing  decision  program 
with  several  frames  of  delay  is  used  for  error  detection  and 


correction,  it  is  essentially  impossible  to  separate  accurate  esti¬ 
mates  from  grosc  errors  bcy  md  some  reasonable  thresf.old 

C  Assumpiiuns  i’ersus  f'.xpentnenra!  Results 
It  was  assumed  ti.at  iFo)  carries  important  speaker  informa¬ 
tion  The  o|  FoM  versus  graph  in  Fig  3  showed  a  signifi¬ 
cant  monotoniw  decrease  asZ.^.  was  increased  In  adJitio-  the 
va.iaiKc  ratio  was  relatively  high  (even  Ihougii  speakers  were 
purposely  .hosen  with  similar  fundamental  frequency  -ariges) 
Therefore,  this  assumption  appears  valid  The  assumpiion  that 
a[op  j  is  rneanin^fiil  dvies  not  appear  to  b:  true  for  conversa 
tional  speech  The  vaiiance  ratio  tor  this  feature  is  extreinelv 
small.  This  result  contradicts  that  shown  by  M.ad  1 13) .  where 
the  use  of  the  first  through  the  fourth  moments  of  F'o  and  ol 
the  first  four  differences  of  F'o  (resulting  in  20  leatures)  was 
suggested.  Our  experience  indicates  that  unless  hand-marked 
or  hand-corrected  Fa  cimtours  are  used,  very  significant  biases 
in  results  can  o^cur  because  of  very  lew  gross  errors  in  F'o  esti¬ 
mation  Higher  .nder  dilterences  and  moments  only  magnily 
these  biases. 

Ihe  star  lard  deviation  ol  the  gam  deviation  feature  as  a 
function  ol  /  ,  shows  a  weak  telationthip  to  expectations  from, 
statistical  sampling  theory.  In  iddition  the  variati,.e  ratios  for 
the  gain  deviation  feature  are  relatively  small.  Although  sviine 
discrimination  is  obtained,  what  we  have  seen  is  that  not  only 
is  there  substantial  intraspeaker  variability  for  this  parameter, 
but  that  m  addition,  considetable  overlap  m  the  gain  ‘eature 
values  occ  -^s  between  speakers.  Other  measures  of  lundamen- 
la!  Ireqiiency  and  gam  variations  mav  prove  to  ne  more  useful 
than  the  ones  used  lure,  which  aie  essentially  based  upon  root 
mean  squares  taken  about  the  averages  One  posstbiliiy  is  the 
use  ol  the  ratio  ol  geometric  and  arithmciic  means  as  used  in 
evaluating  spectral  fiaincss  1 1 1  j 
The  loiig-n-rm  averages  oi  the  refiection  coefficients  as  a  set 
appear  to  be  the  mos:  significant  features  tor  speaker  recogni¬ 
tion  Sol  only  does  the  standard  deviation  of  Ihe  long-term 
averages  show  a  substantial  decrease  as  a  function  of  /  .  but 
m  addition,  the  var.ance  ratios  are  seen  to  be  relatively  large 
lor  most  ol  Ihe  parameters 

D.  Obsenvt  ois  on  Rcjlcctti^tt  Coeffu lent  .Averaging 
Althougli  o«*,o>)<o«A:,»  lor  all  L,  m  Fig.  5(al.  one 
should  not  he  misled  into  thinking  that  U  io>  is  a  belter  feature 
for  speaker  recognition.  This  result  occurs  because  inher¬ 
ently  has  a  larger  standard  deviation  than  /t:,o  (a-,>  =  it,  for 
/-t  “  1).  The  important  fact  to  note  is  that  whatever  the  pj- 
rameter  deviation  is  without  averaging,  due  to  either  linguistic 
content  or  intraspeaker  variability,  it  decreases  as  /.J**  where 
J  ^  ®  ^  5  when  long-term  averaging  is  applied 
In  a  recent  paper  1 1  ■  j .  the  use  of  orthogonal  linear  predic¬ 
tion  parameters  tor  use  ip  text-independent  speaker  r,.ogni 
tion  studies  was  suggested.  Although  very  high  recognition 
scores  were  shown  using  the  orthogonal  linear  prediction  pa 
rameters.  we  would  suggest  that  substantial  reduction  :n  score - 
would  occur  it  unconsliamed  data  bases  as  described  here  were 
use„.  Wliatever  scores  ate  obtained  using,  in  eftect.  l.^.-  I, 
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our  results  quahlartvely  indicate  tiiat  substantial  iinprovenients 
could  occur  by  incorporating  long  term  averaging 
Each  orthogonal  parameter  was  obtained  In  in  a  linear  com¬ 
bination  of  all  reflection  coefficients  as 


text  independent  speaker  recognition  tests  without  any  Imguis 
tic  or  structural  constraints 
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where  the  c,y  terms  were  obtained  from  a  principal  component 
analysis.  The  averaged  parameters  would  then  be 

^  a,).  ,77, 

/•I 

Although  Fig.  6  shows  niy  the  dispcision  characteristics  for 
<*!>  and  </tj>.  similar  cha. acteristics  are  obtained  for  all  the 
coefficients.  The  amount  of  data  dispersion  will  be  priniaiily 
due  to  the  value  of  not  the  fact  that  a  linear  combination 
of  the  kt  terms  (or  the  <k(>  terms)  has  been  obtained. 

E.  ComputatUmjl  Constderations 

Studies  of  this  type  place  a  p.emium  on  the  available  pro 
cessing  speed  of  the  computer  system.  It  became  clcai  early 
in  the  study  that  siiialh  or  medium-scale  computer  capability 
was  insufficient  For  example,  the  analysis  method  described 
runs  in  approximately  100  times  real  time  if  all  operations  ate 
implemented  only  on  the  PDP-1 1  system  The  re'ativcly  small 
data  base  of  speakers  for  this  study  would  have  required  over 
1(X)  hours  of  processing  time. 

Except  for  the  nontrivial  costs  in  software  development  we 
have  found  that  attaching  a  high  speed  processor  to  the  mam 
computer  system  provides  i  very  economical  solution  to  the 
requirements  for  real  time  processing. 

VT.  Si;mm.‘rv 

The  properties  of  long-term  feature  averaging  for  three  sets 
of  fundamental  frequency  related,  gam  related,  and  spectially 
related  pata::ieiers  have  been  investigated.  B.ised  upon  the 
Fisher  dLsc.-iniinant  method,  the  rank  ordering  of  the  param¬ 
eter  sets  in  importance  was  shown  to  be  spectral,  fundamental 
frequency,  and  then  gain.  It  was  also  shown  that  over  a  long 
duration  from  =  10to/.„  =  1000,  the  standard  deviation  of 
the  sample  means  of  the  reflection  coefficient  vectors  de¬ 
creased  proportionally  to 

A  small  number  of  speakers  with  relatively  homogeneous 
characteristics  was  used  to  illusiiate  the  effects  of  long-term 
averaging.  The  data  base  w  as  of  nontrivial  duration,  somewhat 
greater  than  one  hour  in  length.  Furthermore,  the  text  was 
un‘.onstrained  conversa'  imal  speech,  recorded  under  normal 
room  noise  conditions.  Analysis  was  performed  in  real  time 
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Abstract 


Text-Independent  Speaker  Recognition 
from  a  Large  Linguistically  Unconstrained 
Time-Spaced  Data  Base 
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John  D.  Markel  and  Steven  B.  Davis 


A  very  large  data  base  consisting  of  over  thirty-six  hours  of 
unconstrained  extemporaneous  speech,  from  seventeen  speakers,  recorded  over 
a  period  of  more  than  three  months,  has  been  analyzed  to  determine  the 
effectiveness  of  long-term  average  features  for  speaker  recognition. 
Results  are  shown  to  be  strongly  dependent  on  the  voiced  speech  averaging 
interval  L^.  Monotonic  increases  in  the  probability  of  correct 
identification  and  monotonic  decreases  in  the  equal  error  probability  for 
speaker  verification  were  obtained  as  L^  increased,  even  with  substantial 
time  periods  between  successive  sessions.  For  L^  corresponding  to 
approximately  thirty-nine  seconds  of  speech,  text-independent  results  (no 
linguistic  constraints  embedded  into  the  data  base)  of  98.05%  for  speaker 
identification  and  4.25%  for  equal  error  speaker  verification  were 


obtained . 
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I.  Introduction 


In  recent  years,  there  has  been  an  increasing  interest  in  computer- 
based  techniques  for  text-independent  speaker  recognition  (1-6) . 
Recognition  is  used  here  to  encompass  both  speaker  identification  and 
verification  (7)  .  The  term  "text-independent”  has  been  used  in  several 
different  contexts.  For  example,  Atal  (1)  has  used  the  term  in  the  sense 
of  choosing  independent  randomized  test  frames  from  a  single  sentence  to 
use  against  the  remaining  frames  as  a  reference  set.  Sambur  (4)  has  used 
the  term  in  an  experiment  where  the  sentences  in  the  test  set  were 
different  from  those  in  the  reference  set,  even  though  each  speaker  read 
precisely  the  same  list  of  sentences. 

Although  useful  insight  has  been  gained  by  these  approaches,  they  were 
linguistically  constrained.  In  many  practical  situations,  where  text- 
independent  speaker  recognition  is  desired,  there  typically  will  be  no 
control  over  the  speech  being  tested.  As  Beek ,  Neuberg  and  Hodge  (8)  have 
pointed  out,  text-independent  speaker  identification  can  overcome  problems 
which  may  arise  if  the  speaker  is  uncooperative,  and  there  is  a  great 
liiterest  for  speaker  identification  over  communications  channels,  which 
have  no  linguistic  constraints.  Furthermore,  there  may  be  days  to  weeks  of 
separation  between  reference  and  test  sessions. 

Several  other  studies  (2, 3, 5, 6)  have  analyzed  data  with  varying 
amounts  of  linguistic  constraints.  Li  and  Walker  (2)  used  thirty  seconds 
of  speech  read  from  the  rainbow  passage  (9)  recorded  once  by  twenty- two 
male  speakers  and  twice  by  an  additional  eight  male  speakers.  They  did  not 
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specify  the  number  of  days  separating  the  recordings.  They  demonstrated 
that  distances  among  spectral  correlation  matrices  could  be  used  to  compare 
inter-speaker  and  intra-speaker  differences.  However,  the  same  text  was 
used  for  all  tests,  which  could  be  interpreted  as  a  linguistic  constraint. 

Hunt,  Yates  and  Bridle  (6)  used  approximately  six  two-  to  three-minute 
long  FM  radio  weather  forecasts  from  each  of  eleven  male  and  two  female 
speakers.  Each  forecast  was  divided  into  twenty-  or  thirty-second 

intervals  and  long-term  fundamental  frequency  and  cepstral  coefficient 
features  were  computed  for  twenty-millisecond  sequential  frames  in  each 
interval.  They  did  not  specify  the  number  of  days  between  successive 
forecasts  by  the  same  speaker.  Using  Fisher  discriminant  analysis  (10), 
they  achieved  89%  correct  speaker  identification  with  independent  test  and 
reference  sets.  However,  the  speakers  read  text  with  some  effort  at 
uniformity  between  sessions,  which  could  also  be  interpreted  as  a 
linguistic  constraint. 

In  a  preliminary  study,  Markel,  Oshika  and  Gray  (5)  used  one 
fifteen-  to  eighteen-minute  interview  from  each  of  four  male  speakers  with 
somewhat  similar  speech  characteristics.  The  interviews  were  recorded  with 
an  audio  tape  recorder  in  a  normal  room  environment.  Long-term  fundamental 
frequency,  gain  and  reflection  coefficient  features  were  computed  for  every 
1000  sequential  voiced  frames  (twenty-millisecond  windows  per  frame,  fifty 
frames  per  second)  in  each  interview.  Using  the  same  Fisher  discriminant 
analysis  (10)  as  Hunt  et  al .  to  transform  the  data,  they  achieved  perfect 
discrimination  among  the  four  speakers.  These  recorded  interviews  were 
considered  to  be  free  of  linguistic  constraints.  However,  the  data  were 
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insufficient  to  obtain  statistically  significant  results,  and  with  only  one 
session  per  speaker,  there  was  no  analysis  of  speaker  characteristics  over 
time . 


The  purpose  of  this  paper  is  to  present  results  from  experiments  in 
speaker  recognition  where  there  were  no  linguistic  constraints  on  the 
speech  content  (other  than  the  ones  implied  when  the  speaker  is 
cooperative,  and  English  is  used).  In  comparison  with  the  previous  study 
(5),  results  are  presented  for  a  larger  number  of  speakers,  for  multiple 
sessions  from  each  speaker,  and  for  a  greater  number  of  features. 
Furthermore,  the  effects  of  time  between  recording  sessions  are  studied. 
For  practical  implementation,  only  parameters  obtained  from  the  analysis 
portion  of  a  linear  prediction  vocoder  (fundamental  frequency,  gain  and 
reflection  coefficients)  were  used.  (Seek  ^  al .  (8)  have  stated  that  the 
reflection  coefficients  are  currently  favored  for  all-digital  narrowband 
communications  systems.)  This  study  shows  that  if  these  parameters  are 
averaged  over  sufficiently  long  intervals  of  time,  such  as  thirty  seconds 
Ol  more,  the  features  obtained  are  essentially  free  of  linguistic 
constraint,  and  speaker  recognition  performance  is  comparable  with  some 
text-dependent  speaker  recognition  experiments.  The  linguistic  results 
agree  with  Li  and  Walker  (2)  ,  who  used  a  smaller  data  base;  long-term 
speech  features  are  relatively  stable  after  thirty  seconds.  Furthermore, 
this  study  shows  that  if  the  averaging  interval  is  too  fiort,  speaker 
recognition  performance  is  unacceptable  with  linguistically  unconstrained 
extemporaneous  speech.  In  addition,  the  importance  of  having  a  time-spaced 
reference  set  of  sufficient  size  is  demonstrated. 
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II.  Data  Base  and  Processing  Methodology 


A  data  base  was  collected  by  recording  170  fifteen-minute  interviews 
from  eleven  male  and  six  female  speakers.  There  were  ten  sessions  per 
speaker,  with  each  session  separated  by  a  minimum  of  one  week.  Generally, 
the  successive  sessions  were  obtained  within  two  to  three  weeks.  One 
exceptional  separation  between  successive  sessions  was  fourteen  weeks. 

All  sessions  were  recorded  on  a  Tandberg  9000X  two-track  tape  recorder 
at  a  recording  speed  of  7.5  ips.  One  track  was  used  to  record  the 
interviewer  and  the  other  track  was  used  to  record  the  speaker.  The 
speaker  was  recorded  with  a  B  and  K  half-inch  condenser  microphone  and 
amplifier  system  in  an  lAC  sound  room  equipped  with  a  window.  The 

interviewer  was  recorded  with  a  conventional  dynamic  microphone  outside  of 
the  sound  room.  Two-way  communication  was  established  using  headphones. 

Each  session  began  with  the  speaker  reciting  his/her  name,  a  password, 
a  word  list  and  the  first  paragraph  of  the  rainbow  passage  (9)  .  The 
interviewer  posed  a  topic  to  the  speaker,  and  the  remaining  time  (generally 
twelve  to  thirteen  minutes)  was  devoted  to  an  extemporaneous  monolog  by  the 
speaker.  The  interviewer  responded  briefly  when  appropriate,  or  when  it 
was  necessary  to  ask  a  new  question  for  continuity. 

A  wide  range  of  topics  were  covered,  from  describing  a  job  to 
describing  a  frightening  experience.  Although  one  might  argue  that  this 
approach  in  some  sense  constrained  the  data,  casual  listening  of  the 
recordings  demonstrates  that  this  is  not  the  case.  The  topics  generally 
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provided  a  springboard  for  the  speaker's  thoughts,  and  the  speech  was 
usually  conversatonal ,  fluent  and  quite  varied.  (With  one  subject,  the 
suggested  topic  was  consistently  replaced  by  a  wide  variety  of  topics.) 

Several  observations  should  be  noted  which  may  be  of  considerable 
importance  in  practical  situations.  After  the  initial  recording  gain 
calibration  for  each  session,  no  further  gain  adjustments  were  made. 
Subjects  occasionally  became  bored  or  distracted,  and  either  lowered  their 
voice  intensity  or  turned  their  heads  away  from  the  microphone. 
Conversely,  subjects  occasionally  became  intense  on  a  topic  and  nearly 
"swallowed"  the  microphone,  resulting  in  substantial  low  frequency  waveform 
variablity  due  to  breath  bursts.  Also,  there  was  some  stuttering,  throat 
clearing,  laughter,  giggling  and  poor  articulation. 

In  addition  to  these  conditions,  about  half  of  the  subjects  acquired 
various  degrees  of  colds  during  a  two  to  three  week  period.  All  of  these 
cases  were  recorded  in  the  normal  fashion,  and  no  hand  editing  or  deletion 
of  any  data  was  performed.  The  data  used  in  this  study  consisted  of  only 
the  extemporaneous  speech  material  from  the  speakers,  excluding  the  rainbow 
passage,  word  lists,  etc.  The  total  duration  of  the  data  base  is  17 
speakers  x  10  sessions/speaker  x  approximately  13  minutes/session,  or 
approximately  36.8  hours  of  data. 

Several  large  population  and  long  duration  data  bases  have  been 
reported  in  the  literature  (10,11) .  These  were  all  text-dependent  studies 
with  short  names  or  phrases.  However,  even  the  total  duration  of  the  large 
data  base  used  by  Das  and  Mohn  is  only  one-tenth  the  total  duration  of  the 
data  base  used  in  this  study.  The  magnitude  of  this  data  base  was 
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extremely  valuable  for  choosing  feature  subsets  and  defining  reference  sets 
which  spanned  varying  periods  of  time. 

Each  audio  tape  was  manually  cued  to  the  location  where  the 
extemporaneous  portion  of  the  interview  began.  Then  real-time  linear 
prediction  analysis  and  disk  storage  of  the  analysis  parameters  was 
initiated.  The  data  were  low  pass  filtered  at  3250  Hz  and  sampled  at  a 
6500  Hz  rate  for  compatibility  with  future  applications  to  telephone 
systems  and  narrowband  vocoder  systems.  The  speech  samples  were 

preemphasized  with  a  factor  of  0.9,  successive  128-point  frames  were 
multiplied  by  a  Hamming  window,  and  the  autocorrelation  method  of  linear 
prediction  was  used  at  a  rate  of  fifty  frames  per  second.  The  analysis  was 
performed  in  real-time  under  Fortran  control  using  a  commercially  available 
array  processing  system  in  conjunction  with  a  PDF  11/45  computer  (4,12). 
The  analysis  parameters  for  each  speech  frame  were  ten  reflection 
coefficients,  pitch  period  (obtained  from  a  modified  cepstral  pitch 
tracker)  and  gain,  and  were  stored  in  a  quantized  format  of  eight  bits  (one 
byte)  per  parameter.  The  process  was  terminated  when  the  end  of  the  tape 
was  reached  (defined  as  a  thirty-second  silence  interval) .  The  processing 
of  each  interview  resulted  in  an  analysis  file  of  approximately  1000  disk 
blocks  (512  bytes/block) ,  and  all  interviews  together  required  nearly  half 
the  total  space  of  a  200-Mbyte  disk  (340,670  formatted  disk  blocks).  In 
comparison,  it  would  require  ten  200-Mbyte  disks  to  digit-je  all  of  the 
interviews  with  12  bits/sample  and  to  store  directly  without  preprocessing. 


Next,  the  analysis  files  were  used  to  obtain  long-term  feature 
vectors,  where  each  vector  was  the  average  of  successive  voiced  analysis 


frames.  Unvoiced  and  silence  frames  were  not  included  in  this  study,  since 
it  was  felt  that  fundamental  frequency  was  an  essential  speaker-dependent 
parameter.  The  vocoder  analysis  parameters  consisted  of  fundamental 
frequency  (the  reciprocal  of  the  pitch  period) ,  gain  and  ten  reflection 
coefficients.  For  every  interval  L^,  long-term  features  based  upon  the 
mean,  standard  deviation  and  dispersion  (standard  deviation  divided  by 
mean)  of  the  twelve  parameters  were  computed,  resulting  in  thirty-six¬ 
dimensional  feature  vectors.  This  feature  set  was  defined  in  a  reasonably 
general  manner  since  analytic  techniques  for  feature  reduction  may  be  used 
to  find  the  most  reasonable  feature  subsets  for  speaker  recognition. 

A  summary  of  the  number  of  feature  vectors  produced  for  all  170 
interviews  is  given  in  Table  1.  In  this  table,  the  data  are  partitioned 
into  representative  test  and  reference  sets  (13) .  Four  choices  of  were 
studied,  namely  =  30,  100,  300  and  1000.  The  total  number  of  feature 
vectors  and  the  average  real-time  interval  per  feature  vector  as  functions 
of  1*^  are  also  given. 

TABLE  1  GOES  HERE 


It  is  important  to  consider  the  relationship  between  a  particular 
value  of  and  the  real-time  interval  of  a  long-term  feature  vector.  Most 
significantly,  a  fixed  number  of  voiced  frames,  rather  than  all  of  the 
voiced  frames  from  a  fixed  elapsed-time  interval,  was  chosen  for  analysis. 


10 


with  extemporaneous  speech,  there  may  be  intervals  of  ten  to  twenty  seconds 
where  very  little  or  no  voiced  speech  occurs  (the  speaker  may  pause,  cough, 
laugh,  etc.),  leading  to  a  variable  voiced  frame  rate.  If  long-term 
features  were  a  function  of  the  voiced  frame  rate,  then  such  features  would 
not  be  reflective  of  only  a  speaker's  speech  sounds,  but  also  his/her 
speech  rate  and  style.  While  these  additional  characteristics  might  be  a 
source  of  speaker-dependent  information,  they  were  not  considered  in  this 
study,  and  consequently  long-term  features  were  made  independent  of  the 
voiced  frame  rate. 

The  real-time  interval  for  a  long-term  feature  (seconds/feature) 
corresponds  to  a  product  of  the  following  factors;  1)  the  number  of  voiced 
frames  per  feature  vector  (L^) ,  2)  the  reciprocal  of  the  voiced  frame  to 
total  frame  ratio  (or  the  reciprocal  of  the  voicing  duty  factor)  ,  and 
3)  the  reciprocal  of  the  number  of  analysis  frames  per  second  (or  the 
reciprocal  of  the  frame  rate)  .  In  a  previous  study  (5)  ,  the  voicing 
threshold  was  set  such  that  very  smooth  fundamental  frequency  (Fq)  contours 
were  observed  on  a  real-time  display  system,  and  as  a  result,  =  1000 
corresponded  to  approximately  seventy  seconds  of  real  speech.  For  this 
study,  the  voicing  threshold  was  determined  by  synthesizing  the  speech 
using  the  Fg  contour  obtained,  and  then  selecting  the  threshold  that 
produced  the  subjectively  best  synthesis.  The  ear  appears  more  sensitive 
to  voiced  speech  segments  which  are  synthesized  as  unvoiced,  rather  than 
the  reverse,  i.e.  buzziness  is  typically  preferred  over  whispery  or  hoarse 
speech.  As  a  result,  more  voiced  decisions  were  made,  and  =  1000  in 
this  study  corresponded  to  approximately  thirty-nine  seconds  of  speech. 


The  feature  vectors  for  each  interview  for  each  of  the  above  values  of 
required  approximately  301,  93,  33  and  13  disk  blocks  respectively,  and 
a  total  of  74,800  disk  blocks  were  required  to  store  the  feature  vectors 
for  the  various  conditions  for  the  170  interviews.  These  data  were  then 
further  processed  as  described  in  the  next  section. 

III.  Experiments  in  Parameter  Variability 

A.  Intra-Speaker  Variability 

In  a  previous  study  (5),  the  within  speaker  (intra-speaker)  variablity 
of  the  features  for  one  male  speaker  was  demonstrated  to  be  a  monotonically 
decreasing  function  of  from  =  1  to  =  1000  for  a  single  fifteen 
minute  session.  Using  the  data  base  in  this  study,  it  was  possible  to 
study  the  intra-speaker  variability  for  a  larger  number  of  male  and  female 
speakers,  and  in  addition,  it  was  possible  to  study  the  intra-speaker 
variability  for  cumulative  sessions.  If  individual  sessions  are  described 
by  S{i),  i*l,10,  then  cumulative  sessions  may  be  described  by  C(i),  i=l,10, 
where  C(i)  *  S (1) +S (2) +. . .+S (i) . 

The  standard  deviations  of  the  long-term  averages  of  the  fundamental 
frequency  and  the  first  reflection  coefficient,  denoted  as  <kj^> 

respectively,  as  measured  over  the  cumulative  sessions  C(i)  for  one  male 
and  one  female  speaker,  are  shown  in  Figure  1. 
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For  both  speakers  and  for  each  set  of  cumulative  sessions,  ^^0^ 
decreases  as  increases.  This  behavior  demonstrates  that  over  long 

intervals,  a  speaker's  average  fundamental  frequency  is  (probably)  a  good 
estimator  of  a  characteristic  or  "habitual"  value,  and  for  successive  long 
intervals,  the  deviation  from  the  habitual  value  is  small.  For  short 
intervals,  influences  such  as  speech  prosody  may  mask  the  habitual  value, 
and  successive  short  intervals  will  deviate  more  widely  from  each  other. 
This  concept  of  habitual  fundamental  frequency  is  paralleled  by  the  concept 
of  habitual  (perceived)  pitch;  the  latter  is  used  in  speech  therapy  as  a 
measure  of  acoustic  improvement  during  treatment  of  a  functional  or  organic 
voice  disorder  (14),  and  is  an  important  factor  in  listener-based  speaker 
recognition.  For  both  speakers  and  for  each  value  of  L^,  there  is  a  trend 
for  <fg>  to  increase  as  more  sessions  are  included  (although  there  are 
exceptions,  e.g.  for  the  female  speaker,  <Fq>  for  C(l)  is  greater  than 

<Pq>  for  C(2)).  The  dependence  of  <^0^  ^v  approximately  be 

described  as  proportional  to  which  agrees  with  the  theoretical 

relationship  between  the  variance  of  a  set  of  samples  of  a  stationary 
random  process,  e.g.,  the  samples  of  Fg,  and  the  variance  of  the  process 
(5)  .  In  absolute  terms,  the  standard  deviation  of  the  long-term 
fundamental  frequency  averages,  over  a  time  span  of  more  than  three  months, 
varies  from  17-23  Hz  at  »  30  to  4-8  Hz  at  =  1000  for  the  male 
speaker,  and  from  28-33  Hz  at  =  30  to  8-11  Hz  at  =  1000  for  the 

female  speaker. 

The  behavior  of  <kj^>  as  Ly  increases  mirrors  the  behavior  of  <Fg>  as 
increases.  Since  the  value  kj^  is  a  monotonic  function  of  spectral 
slope  of  a  first-order  linear  prediction  inverse  filter  for  speech  (5,15), 
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then  a  parallel  explanation  in  terms  of  "habitual  spectral  slope"  may  be 
given,  i.e.,  the  longer  the  interval,  the  better  the  estimate  of  the 
habitual  spectral  slope.  However,  as  more  sessions  are  included,  the 
behavior  of  <kj^>  differs  from  the  behavior  of  <^0^*  ^  given  L^, 

there  is  essentially  no  measurable  increase  in  variability  as  the  time 
period  increases  from  one  fifteen  minute  session  to  a  period  of  nearly 
three  months,  with  all  ten  sessions  included.  This  trend  is  observed  for 
the  other  speakers  and  the  other  long-term  reflection  coefficient  averages, 
thus  substantiating  the  presence  of  an  "habitual  spectral  characteristic" 
for  each  speaker.  Since  the  reflection  coefficients  are  used  to  describe 
the  vocal  tract  shape  in  an  acoustic  tube  model  (16) ,  the  result  implies 
that  the  physical  characteristics  of  a  subject’s  vocal  tract  show  no 
observable  changes  over  at  least  several  months. 

Furui  ^  al .  (17-20)  have  examined  speaker  variability  over  intervals 
from  a  few  weeks  to  several  years.  Their  studies  dealt  with  the 
variability  of  repeated  word  lists  and  short  sentences.  They  found  that 
for  increasing  time  intervals  from  about  three  weeks  to  three  months, 
spectral  parameters  such  as  reflection  (PARCOR)  or  cepstral  coefficients 
showed  increasing  variation.  In  contrast,  the  standard  deviation  of  the 
reflection  coefficients  in  this  study  show  essentially  no  variation  over 
time.  Perhaps  the  data  of  Furui  et  al .  were  too  linguistically 
constrained,  and  speakers  never  approached  their  habi -ual  spectral 
characteristic. 

In  summary,  inter-speaker  variability  based  on  averaged  features 
decreases  monotonically  as  the  averaging  interval  increases.  Furthermore, 
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for  a  large  averaging  interval,  inter-speaker  feature  variability  is 
relatively  consistent  over  a  time  period  of  three  months.  The  next  aspect 
of  this  study  is  a  comparison  which  includes  the  intra-speaker  information, 
e.g.  a  feature-by-feature  analysis  which  uses  the  values  of  each  feature 
from  all  subjects.  If  some  features  have  small  inter-speaker  variance 
compared  to  the  intra-speaker  variance,  then  those  features  will  not  be 
useful  for  speaker  recognition,  and  the  performance  of  a  classifier 
designed  to  recognize  speakers  from  these  features  may  be  poor. 


B.  Variance  Ratio  Analysis 


One  method  of  measuring  the  usefulness  of  a  feature  for  speaker 
recognition  is  the  F-ratio  or  variance  ratio  (also  referred  to  as  the 
generalized  Fisher  discriminant)  (7,10,19).  The  variance  ratio  of  a 

feature  is  the  quotient  of  the  inter-speaker  variance  and  the  intra-speaker 
variance  (11).  In  general,  the  larger  the  variance  ratio  for  a  particular 
feature,  the  greater  the  probable  contribution  of  the  feature  in 
distinguishing  the  speakers  (13) ,  but  this  property  is  strongly  dependent 
on  the  data  and  the  experimental  procedure.  However,  the  variance  ratio 
does  not  account  for  inter-feature  correlations,  and  if  two  features  with 
high  variance  ratios  are  highly  correlated,  then  the  inclusion  of  both 
parameters  might  be  somewhat  redundant  (7). 


1.  Trends  as  a  function  of  population 
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The  variance  ratios  for  the  case  =  1000 

and  cumulative  sessions  1-10  are  shown  in  Figure  2  for  the  male  and 
female  speakers  separately,  and  in  Figure  3  for  two  subsets  of  the  male 
speakers.  Only  the  variance  ratios  of  the  mean  and  standard  deviation 
features  are  shown.  The  variance  ratios  of  the  dispersion  features  were 
consistently  low,  and  therefore  believed  to  contribute  very  little  toward 
speaker  recognition  in  this  study. 

There  are  noticable  differences  in  the  variance  ratios  between  the 
male  and  female  populations.  Based  on  relative  magnitudes,  the  features 

<  (l<g)>,  <  (k0)>  and  <k2>  would  be  the  most  significant  for  identifying  the 
male  population,  while  <  (k^)>,  <  (kg>  and  <kg>  would  be  the  most 

significant  for  identifying  the  female  population.  If  the  male  population 
is  arbitrarily  divided  into  two  equal-sized  subsets,  there  are  pronounced 
changes  in  the  variance  ratios.  For  the  first  set  of  male  speakers,  <kj^>, 
<Fg>  and  <k2>  have  the  largest  variance  ratios,  and  for  the  second  set  of 
male  speakers,  <k^>,  <kg>  and  <k3>  have  the  largest  variance  ratios.  These 
results  show  the  need  to  have  a  substantially  larger  speaker  population  in 
order  to  characterize  the  parameters  of  major  importance.  However,  it  is 
estimated  that  to  obtain  variance  ratios  which  would  exhibit  consistent 
trends  for  a  set  of  speakers  and  for  subsets  of  the  speakers,  a  much  larger 
data  base,  possibly  more  than  100  speakers,  would  be  required. 

In  the  previous  paper  (5),  for  a  smaller  and  more  homogeneous  data 
base,  <k2>  and  <kg>  were  found  to  be  the  most  significant  parameters. 
These  large  variance  ratios  would  be  physical  evidence  for  the  importance 
of  the  first  and  third  formants  in  voiced  speech  (5)  .  Tnis  larger 
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population  base,  however,  shows  no  such  relationships.  The  conclusion  is 
that  for  studies  with  linguistically  unconstrained  speech,  parameter 
ranking  using  variance  ratios  should  be  used  cautiously.  The  parameters 
with  large  variance  ratios  may  change  depending  on  how  the  data  are 
partitioned,  and  the  features  with  small  variance  ratios  may  be  important 
for  achieving  good  speaker  recognition  if  the  data  partitioning  is  changed. 
(Conversely,  it  will  be  shown  that  some  parameters  with  small  variance 
ratios  may  actually  degrade  speaker  recognition.) 


2.  Trends  as  a  function  of  and  time-spacing 


The  variance  ratios 

were  determined 

for 

the 

case  =  100 

and 

cumulative  sessions  1-10 

(Figure  4 ) ,  and 

for 

the 

case  L  =  1000 

V 

and 

cumulative  sessions  1-2  (Figure  5).  Comparing  Figures  2  and  4,  which  only 
differ  by  the  averaging  interval  L^,  the  variance  ratios  generally  maintain 
the  same  relative  relationships,  i.e.  the  features  which  have  the 
relatively  larger  variance  ratios  for  =  1000  also  have  the  relatively 
larger  variance  ratios  for  =  lOO.  However,  the  absolute  values  of  the 
variance  ratios  are  smaller  for  =  100  than  for  =  1000.  Comparing 
Figures  2  and  5,  which  only  differ  by  the  number  of  sessionr^  the  relative 
relationships  and  the  absolute  values  of  the  variance  ratios  are  similar 
for  two  cumulative  sessions  and  for  ten  cumulative  sessions.  However,  a 
slight  decrease  in  the  absolute  values  of  the  variance  ratios  for  ten 
cumulative  sessions  is  observed.  If  the  inter-speaker  variance  is  assumed 
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relatively  constant  for  two  or  ten  cumulative  sessions,  then  the  slight 
decrease  in  variance  ratios  for  ten  sessions  over  two  sessions  correlates 
with  the  slight  increase  in  standard  deviations  observed  in  Figure  1.  This 
result  further  establishes  that  a  speaker's  habitual  features,  when 
measured  over  a  relatively  long  interval  (greater  than  thirty  seconds) ,  do 
not  show  appreciable  changes  over  time  periods  up  to  three  months. 


3.  Further  observations 


It  is  also  evident  that  the  variance  ratios  for  the  mean  features 
generally  have  larger  values  than  the  corresponding  variance  ratios  for  the 
standard  deviation  features.  The  variance  ratios  for  the  dispersion 
features  are  in  turn  substantially  lower  in  value  than  the  corresponding 
variance  ratios  for  the  standard  deviation  features.  Features  based  upon 
gain  have  consistently  small  variance  ratios. 


IV.  Speaker  Recognition 


Speaker  recognition  was  based  on  a  weighted  Euclidean  distance  metric 
(5,7,11),  where  the  mean  vector  and  inverse  covariance  matrix  for  each  of 
the  seventeen  speakers  were  estimated  from  feature  vectors  in  the  specified 


reference  set.  All  thirty-six  dimensions  were  used  initially.  The 
distances  between  each  reference  class  and  each  test  vector  were  computed, 
and  the  test  vector  was  assigned  to  the  reference  class  which  yielded  the 
smallest  distance.  For  speaker  identification,  a  tally  was  taken  of  the 
number  of  correct  choices.  For  speaker  verification,  the  distances  were 
stored  for  further  analysis  with  a  variable  distance  threshold.  The  method 
of  cross-validation  in  both  directions  was  used  (11)  ,  where  independent 
subsets  of  the  data  were  cyclically  treated  as  test  and  reference  groups, 
and  the  speaker  recognition  scores  for  each  cycle  were  averaged  for  the 
final  scores. 

Atal  (7)  and  Bricker  ^  al .  (10)  discussed  three  possible  choices  for 
a  distance  metric.  Each  metric  was  a  positive  semidefinite  form  which 
could  be  described  by  d  =  (X-Y^)  m  (X-Y|)^/  where  X  was  a  vector  to  be 
classified,  Y^  was  the  mean  vector  for  class  i,  and  M  was  a  weighting 
matrix.  The  choices  for  M  were  a  pooled  covariance  matrix  from  all 
speakers,  an  individual  covariance  matrix  from  each  speaker,  or  a 
discriminant  matrix  D  composed  of  the  eigenvectors  of  B,  where  B  was 
the  between-class  covariance  matrix. 

The  use  of  the  discriminant  matrix  D  requires  sufficient  knowledge  of 
the  inter-speaker  variability,  which  may  be  difficult  to  attain  unless  an 
extremely  large  number  of  speakers  is  used.  Atal  (7)  and  Bricker 
et  al .  (10)  preferred  the  pooled  covariance  matrix  W  over  the  individual 
covariance  matrix  W“^.  Their  rationale  was  that  data  limitations  (less 
samples  than  dimensions)  frequently  result  in  a  singular  (noninvertible) 
covariance  matrix,  and  that  one  pooled  covariance  matrix  would  adequately 
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represent  all  speakers,  even  though  speaker  dependent  data  is  contained  in 
individual  covariance  matrices  and  subsequently  is  not  used. 

From  Table  1,  the  average  number  of  feature  vectors  per  speaker  per 

session  for  =  30,100,300,1000  is  685  (116411/170),  205  (34862/170),  68 
(11563/170)  and  20  (3413/170)  respectively.  For  L  =  30,100  or  300,  with 

V 

thirty-six  dimensions,  the  individual  covariance  matrices  were  never 
singular  for  any  number  of  pooled  sessions.  For  L^,  =  1000,  with  thirty-six 
dimensions,  the  individual  covariance  matrices  were  singular  if  less  than 
three  sessions  are  pooled.  Furthermore,  Kanal  (14)  has  suggested  that  ter. 
times  the  number  of  dimensions  is  an  adequate  sample  size  for  good 
covariance  matrix  estimates  with  normal  probability  distribution 
assumptions.  For  five  sessions  and  thirty-six  dimensions  in  a  reference 
class,  the  factors  which  relate  sample  size  to  dimensionality  for 
\  =  30,100,300,1000  are  95  (685*5/36),  28  (205*5/36),  9  (68*5/36)  and  3 

(20*5/36)  respectively.  For  =  1000,  sessions  as  long  as  forty-five 
minutes  would  have  been  necessary  to  produce  a  factor  near  ten,  but  a 
factor  as  large  as  ten  is  probably  not  needed  for  features  which  are 
themselves  the  average  of  1000  frames  of  data.  However,  fifteen  minutes 
was  a  sufficient  duration  for  the  other  values  of  ,  as  well  as  an  upper 
limit  of  endurance  for  the  subject  and  interviewer.  It  was  felt  that  the 
advantages  gained  through  the  use  of  individual  covariance  matrices 
outweighed  potential  problems  of  undersampling  the  speaker -s  statistics. 
In  a  practical  situation,  relatively  long  sessions  would  be  necessary  for 
sufficient  accumulation  of  speaker's  reference  data,  but  thereafter  the 
speaker  could  be  verified  approximately  every  thirty-nine  secon 


A.  Trends  as  a  function  of 


For  the  first  series  of  tests,  the  first  five  sessions  were  treated  as 
the  reference  data,  the  second  five  sessions  were  treated  as  the  test  data, 
and  then  vice-versa.  Results  are  shown  in  Table  2.  In  Table  2a,  it  is 
seen  that  the  average  scores  for  the  probability  of  correct  identification 
P(CI)  monotonically  increase  from  60%  to  nearly  92%  as  increases  from  30 
to  1000  respectively.  A  confusion  matrix  of  identification  errors  shows 
that  no  one  speaker  is  more  difficult  to  identify  than  any  other  speaker. 
In  Table  2b,  as  increases,  the  speaker  verification  equal  error 

probability  (probability  of  false  acceptance  P(FA)  equals  the  probability 
of  raise  rejection  P(FR))  monotonically  decreases  from  43.1%  to  8.8%.  This 
trend  is  principally  due  to  the  P(FA)  behavior,  since  the  P(FR)  behavior 
does  not  change  appreciably  with  (5).  Although  the  distance  threshold 
for  a  given  probability  of  correct  acceptance  and  fixed  dimensionality 
(under  multivariate  normal  assumptions)  may  be  analytically  obtained,  the 
distance  threshold  for  the  equal  error  probability  can  only  be  determined 
experimentally.  In  Table  2c,  the  equal  error  probability  distance 
threshold  is  seen  to  monotonically  increase  as  increases. 


TABLE  2  GOBS  HERE 
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It  is  interesting  to  illustrate  the  difference  between  text- 
independent  speaker  recognition  with  and  without  linguistic  constraints. 
Sambur  has  proposed  an  orthogonal  linear  prediction  set  of  parameters  for 
text-independent  speaker  recognition  (4).  Within  the  context  of  a 
linguistically  constrained  experiment  where  all  speakers  spoke  the  same  set 
of  sentences,  Sambur 's  text-independent  results  (in  the  sense  that  the 
reference  sentences  were  different  from  the  test  sentences)  were  near  94%. 
The  orthogonal  linear  prediction  parameters  are  essentially  equivalent  to  a 
linear  transformation  of  the  long-term  reflection  coefficients  averages 
used  in  this  study  if  s  1  (equivalent  to  no  averaging)  .  If  all 
linguistic  constraints  are  removed,  and  if  little  or  no  averaging  is  used, 
the  results  of  Table  2  indicate  that  the  speaker  identification  scores  for 
a  true  text-independent  situation  with  a  reasonable  number  of  speakers  wijl 
be  quite  poor  (even  for  s  30,  p(CI)  is  bounded  from  above  at  62%).  A 
similar  statement  follows  for  the  case  of  speaker  verification. 


B.  Trends  as  a  function  of  time  spacing 


Rosenberg  (21)  has  noted  that  one  of  the  most  important  considerations 
in  designing  a  data  base  is  the  time  period  over  which  utterances  are 
collected  and  the  methods  for  establishing  reference  patterns  over  time. 
Following  the  pictorial  scheme  of  Furui  et  al .  (14-17)  for  illustrating 

reference  and  test  sets  over  time,  speaker  recognition  for  fouryases  shown 
in  Figure  6  were  investigated.  Reference  sets  were  composed  of  from  two  to 
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five  successive  sessions  (with  a  time  interval  of  at  least  one  week  between 
sessions) .  No  comingling  such  as  odd-numbered  reference  sessions  and  even- 
numbered  test  sessions  was  allowed.  For  each  case,  the  reference  and  test 
sets  were  composed  of  equal  numbers  of  successive  independent  r«ssions,  and 
two-direction  recognition  tests  (as  described  above)  were  made  for  the  four 
cases. 

The  results  are  presented  in  Table  3.  It  is  seen  that  for  all 
conditions,  higher  scores  were  obtained  as  the  number  of  cumulative 
sessions  increased. 


TABLE  3  GOES  HERE 


The  differences  in  the  speaker  identification  score  between  the  first 
two  sessions  and  the  first  five  sessions  is  around  15%  for  all  cases 

shown  (L^  s  1000  was  not  used  for  two  sessions  since  the  covariance 
matrices  were  singular)  .  It  is  interesting  to  note  that  in  a  text- 
dependent  speaker  verification  experiment  with  different  parameters  and 
approaches.  Luck  (22)  found  that  speech  samples  collected  over  a  five  week 
period  gave  the  best  results. 


C.  Trends  as  a  function  of  feature  subsets 
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In  a  previous  section,  it  was  noted  that  the  dispersion  features  had  very 
small  variance  ratios,  whereas  the  mean  features  as  a  group  consistently 
had  the  largest  variance  ratios.  How  would  recognition  scores  compare  if 
the  dispersion  features  were  omitted,  or  if  only  the  mean  features  were 
included?  The  recognition  test  for  =  1000  and  five  sessions  per 
reference  and  test  set  was  repeated  using  several  different  feature 
subsets,  based  on  an  analysis  of  the  magnitudes  of  the  variance  ratios.  In 
one  case,  only  the  twelve  mean  features  were  used,  and  in  a  second  case, 
only  the  twenty-four  mean  and  standard  deviation  features  were  used.  The 
average  scores  for  the  two  cases  were  P{CI)  =  93.6%  with 
P(FA)  =  P{PR)  =  14.5%,  and  P(CI)  =  96.8%  with  P(FA)  =  P(FR)  =  7.2% 
respectively.  For  comparison,  the  comparable  average  scores  for  all 
thirty-six  features  (Table  2)  were  P(CI)  =  91.6%  with  P(FA)  =  P(FR)  *  8.8%. 

Not  only  did  both  of  these  new  cases  based  on  feature  subsets  yield 
better  scores  than  the  original  thirty-six  dimension  feature  set,  but  in 
the  second  case,  the  identification  score  was  markedly  increased  by  more 
than  5%.  This  result  is  a  significant  practical  illustration  that  the 
inclusion  of  some  parameters  which  would  hopefully  improve  performance  (or 
at  worst  case  would  have  no  effect  on  performance)  ,  can  sometimes  actually 
degrade  the  system  performance  in  an  open  test.  In  a  closed  test  with  the 
distance  metric  used  in  this  study,  where  a  reference  set  also  is  used  as  a 
test  set,  this  theoretically  cannot  happen.  Closed  tests  on  -this  data  base 
verified  that  raonotonic  increases  in  the  number  of  features  produced 
monotonic  increases  in  the  P(CI)  and  monotonic  decreases  in  equal  error 
probability  P(FA)  »  P(PR). 
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This  improved  performance  by  eliminating  features  with  relatively 
small  variance  ratios  was  the  basis  for  one  additional  test  with  a  feature 
subset.  In  considering  the  remaining  twenty-four  features,  the  gain- 
related  features  had  very  small  variance  ratios,  and  furthermore,  the 
inclusion  of  gain-related  features  was  difficult  to  physically  justify.  In 
fact,  it  could  be  argued  that  even  if  these  features  helped,  they  should 
not  be  included  because  they  may  simply  reflect  a  speaker's  position, 
interest,  etc.  during  the  recording  session.  Therefore,  the  recognition 
test  with  only  twenty-four  features  was  repeated  with  the  gain-related 
features  removed,  and  the  performance  of  this  last  test  with  only  twenty- 
two  parameters  was  better  than  any  previous  test.  The  final  results  of 
this  study  using  only  the  twenty-two  fundamental  frequency  and  reflection 
coefficients  long-term  averages  are  shown  in  Table  4. 


TABLE  4  GOES  HERE 


These  results  are  extremely  promising  for  future  studies  in  many  areas  of 
speaker  recognition.  This  substantially  large  testing  effort  (over  eighty 
million  distance  measurements)  has  shown  that  realistic  and  acceptable 
speaker  identification  and  speaker  verification  can  be  achieved  with 
text-independent  linguistically  unconstrained  speech. 


The  cumulative  probability  functions  (Figure  7A)  and  the  probability 
density  functions  (Figure  7B)  for  false  rejection  and  false  acceptance  may 
be  used  to  compare  the  inter-  and  intra-speaker  distances  in  the 
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verification  task.  These  curves  are  derived  from  the  first  half  of  the 
final  speaker  verification  test  with  22  features,  =  1000,  reference 
sessions  1-5  and  test  sessions  6-10.  The  equal  error  point  is  graphically 
depicted  as  the  crossover  point  of  the  two  cumulative  probability  curves  in 
Figure  7A.  This  equal  error  point  is  found  at  a  distance  threshold  where 
the  probability  of  false  acceptance  (i.e.  acceptance  of  an  imposter)  is 
equal  to  the  probability  of  false  rejection  (i.e.  rejection  of  a  correct 
speaker)  . 

The  probability  density  functions  (pdfs)  in  Figure  7B  show  the 
distribution  of  the  intra-  and  inter-speaker  distances.  The  crossover 
point  in  Figure  7A  divides  each  of  the  pdfs  into  two  sections,  with  the 
area  under  the  intra-speaker  pdf  to  the  right  of  the  dividing  line  equal  ti¬ 
the  area  under  the  inter-speaker  pdf  to  the  left  of  the  dividing  line.  For 
this  data,  the  equal  error  crossover  point  is  close  to  the  intersection  of 
the  two  pdfs,  but  only  identical  and  symmetric  pdfs  will  always  have 
identical  crossover  and  intersection  points. 

For  test  sessions  6-10  with  =  1000,  there  were  a  total  of  1708  test 
vectors  from  the  17  speakers.  The  distances  between  each  of  these  test 
vectors  and  the  correct  reference  speaker  comprise  the  intra-speaker 
distance  space.  A  histogram  of  these  intra-speaker  distances  is  shown  in 
Figure  8A.  The  mean  and  standard  deviation  of  the  histogram  distances  were 
used  to  approximate  normal  and  log-normal  distributions.  For  the  open 
test,  there  is  no  underlying  theoretical  distribution,  and  a  chi-square 
test  was  used  to  measure  the  goodness  of  fit  of  the  normal  and  log-normal 
distributions.  The  log-normal  distribution  had  the  smallest  chi-square 
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measure.  Analogously,  the  distances  between  each  of  the  1708  test  vectors 
and  each  of  the  16  incorrect  speakers  (i.e.  eliminating  the  reference 
speaker  who  is  a  correct  match  to  the  test  vector)  comprise  the  inter¬ 
speaker  distance  space.  A  histogram  of  the  27,318  inter-speaker  distances 
is  shown  in  Figure  8B.  A  log-normal  distribution  is  a  better  fit  to  the 
inter-speaker  histogram  than  a  normal  distribution,  but  not  as  good  a  fit 
as  with  the  intra-speaker  histogram. 


V.  Summary 


The  significance  and  value  of  long-term  feature  averaging  for  text- 
independent  speaker  recognition  with  linguistically  unconstrained  speech 
has  been  demonstrated.  This  study  used  practical  analysis  conditions  of 
telephone-range  spectral  width  (0-3250  Hz)  and  parameters  obtained  from  a 
linear  prediction  vocoder.  All  parameter-related  computations  were 
performed  in  real  time  using  16-bit  integer  arithmetic,  and  all  parameters 
were  further  quantized  into  an  8-bit  format  for  efficient  disk  storage. 

The  recording  environment  was  controlled  by  recording  the  speakers 
with  a  condenser  microphone  in  an  lAC  sound  room.  An  important  extension 
of  this  work  would  be  to  reprocess  the  "clean-text"  audio  tapes  through 
various  channel  disturbances  such  as  the  telephone  system  to  determine  the 
robustness  of  the  approach  in  less  ideal  environmental  conditions  (20). 
Also,  in  some  situations,  reference  data  may  be  obtained  in  a  clean 
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environment  and  subsequent  speaker  recognition  attempted  in  a  noisy 
environment.  This  area  also  requires  investigation. 

Although  seventeen  speakers  is  not  a  trivial  population  size,  it 
appears  that  for  determining  the  importance  of  individual  features  for 
speaker  recognition  using  linguistically  unconstrained  text,  a 
substantially  larger  population  base  is  requited.  It  was  found  that 
features  obtained  from  only  one  or  two  sessions  of  a  given  population  are 
relatively  unchanged  over  a  much  larger  number  of  time-spaced  sessions, 
where  there  was  at  least  one  week  between  sessions.  Other  features  should 
also  be  investigated.  It  has  been  suggested  that  mean  deviations  (21)  may 
prove  more  useful  than  the  standard  deviations  used  in  this  study.  Further 
research  is  also  required  to  assess  the  conditions,  e.g.,  the  number  of 
long-term  samples  from  a  speaker,  for  obtaining  a  good  estimate  of  the  mean 
and  variance  of  a  speaker's  characteristics. 

An  assumption  throughout  has  been  that  only  voiced  speech  frames  are 
to  be  used  in  the  analysis.  If  this  assumption  was  not  necessary,  or  if 
only  slight  degradation  occurred  if  both  voiced  and  unvoiced  speech  frames 
were  included,  the  process  would  be  simplified  computationally,  and  in 
addition,  1000  frames  per  average  would  correspond  to  a  real  time  interval 
only  about  half  as  long  as  required  here. 

The  best  speaker  recognition  was  obtained  when  1)  five  sessions 
successively  separated  by  at  least  one  week  were  used  to  define  the 
reference  set,  2)  the  mean  and  standard  deviation  of  the  long-term  averages 
of  the  fundamental  frequency  and  reflection  coefficients  were  used,  and 
3)  each  feature  was  obtained  by  averaging  1000  voiced  analysis  frames 
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(corresponding  to  average  real-time  intervals  of  about  thirty-nine 
seconds) .  With  approximately  eighteen  hours  of  reference  data  and  eighteen 
hours  of  independent  test  data  from  seventeen  speakers,  spaced  over  neatly 
three  months  in  time,  an  average  speaker  identification  score  of  98.05%  and 
an  average  equal  error  speaker  verification  rate  of  4.25%  were  measured. 
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FIGURES 


Fig.  1  Standard  deviation  of  long-term  features  as  a 

function  of  L^,  the  number  of  voiced  frames  per 
feature  vector. 

Fig.  2  Variance  ratios  from  all  10  sessions  as  a  function 
of  long-term  mean  and  standard  deviations  of 
parameters . 

A)  all  male  speakers 

B)  all  female  speakers.  L^=1000 

Fig.  3  Same  conditions  as  Fig.  2  except 

A)  male  speakers : first  five 

B)  male  speakers : second  five 

Fig.  4  Same  conditions  as  Fig.  2  except  that  L^=100: 

A)  all  male  speakers 

B)  all  female  speakers 

Fig.  5  Same  conditions  as  Fig.  2  except  only  sessions  1-2 
shown: 

A)  all  male  speakers 

B)  all  female  speakers 

Fig.  6  Relations  between  reference  samples  and  test  samples 
for  experimental  results  of  Table  3 . 

Fig.  7  Intra  and  inter-  Speaker  Comparisons 

A)  Cumulative  Probability 

B)  Probability  Density  Estimates 

Fig.  8  Distance  Histograms  and  Models 

A)  Intra-speaker  Distances 

B)  Inter-speaker  Distances 


TABLES 

Table  1  Number  of  feature  vectors  and  average  real-time 
interval  (RTI)  for  each  condition. 

Table  2  Speaker  recognition  based  on  partitioning  dcca  in  half 
and  with  36  long-term  features. 

Table  3  Percent  of  speakers  correctly  identified  as  a  function 
of  the  number  of  reference  sessions. 

Table  4  Performance  with  fundaunental  frequency  and  ref  «^ion 
coefficient  mean  and  standard  deviation  long-term 
features,  Ly=1000  (average  real-time  interval  =  39 
seconds) . 
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Fig.  1  Standard  deviation  of  long-term  features  as  a  function 
of  L  ,  the  number  of  voiced  frames  per  feature  vector. 
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Fig.  2  Variance  ratios  from  all  10  sessions  as  a  function  of 
long-term  mean  and  standard  deviations  of  parameters. 

A)  all  male  speakers 

B)  all  female  speakers.  L  =1000 


VARIANCE  RATIO  VARIANCE  RATIO 
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Fig.  J  Same  oonditiona  as  Fig.  2  except 

A) male  speakers : firs t  five 

B) male  speakers : second  five 
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Fig.  4  Same  conditions  as  Fig.  2  except  that  L  =100: 

A)  all  male  speakers  ^ 

B)  all  female  speakers 
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PARAMETERS 


Fig.  5  Same  conditions  as  Fig.  2  except  only  sessions  1-2  shown: 

A)  all  male  speakers 

B)  all  female  speakers 
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Fig.  6  Relations  between  reference  samples  and  test  samples  for 
experimental  results  of  Table  3. 
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Fig.  8  Distance  Histograms  and  Models 

A )  Intra-speaker  Distances 
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Table  1.  Number  of  feature  vectors  and  average  real-time  interval  (RTI) 


A^ 

SPEAKER  IDENTIFICATION 
Percent  of  correct  choices  based  on  minimum  distance 


SESSION 

•-V 

REF. 

TEST 

30 

100 

300 

1000 

1-5 

6-10 

6-10 

1-5 

61.20 

59.87 

78,65 

75.48 

88.20 

85.27 

93.34 

89.77 

AVERAGE 

60.54 

77.06 

86.74 

91.56 

[!3) 

SPEAKER  VERIFICATION 
Percent  of  false  acceptances  and  false  rejections  based 

on  equal  error  criterion 


SESSION 

L 

V 

REF 

TEST 

30 

100 

300 

1000 

1-5 

6-10 

43.4 

27.8 

10.7 

9.4 

6-10 

1-5 

42.8 

26.9 

10.5 

8.2 

AVERAGE 

43.1 

27.4 

10.6 

8.8 

Cc) 

SPEAKER  VERIFICATION 
Threshold  distance  based  on  equal  error  criterion 


SESSION 

L 

V 

REF 

TEST 

30 

100 

300 

1000 

1-5 

6-10 

5.79 

7.52 

9.78 

18.84 

6-10 

1-5 

5.85 

7.58 

10.85 

21.10 

AVERAGE 

5.82 

7.55 

10.32 

19.97 

Table  2.  Speaker  recognition  based  on  partitioning  data  in  half  and  with 
36  long-term  features 


SESSIONS 

L 

V 

NO, 

REF 

TEST 

30 

100 

300 

1000 

2 

1-2 

3-4 

50.36 

64.34 

71.18 

2 

3-4 

1-2 

53.45 

67.95 

75.31 

— 

3 

1-3 

4-6 

54.29 

70.03 

79.12 

80.58 

3 

4-6 

1-3 

57.04 

72.69 

82.14 

89.30 

4 

1-4 

5-8 

59.91 

76.41 

86.73 

92.85 

4 

5-8 

1-4 

59.26 

74.62 

83.45 

86.34 

5 

1-5 

6-10 

61.20 

78.65 

88.20 

93.34 

5 

6-10 

1-5 

59.87 

75.48 

85.27 

89.77 

Table  3.  Percent  of  speakers  correctly  identified  as  a  function  of  the 
number  of  reference  sessions 


FINAL  RESULTS  OF  2-WAY  TESTING  ON  38  HOURS  OF 

EXTEMPORANEOUS  SPEECH 


Table  4.  Performance  with  fundamental  frequency  and  reflection  coefficient 

mean  and  standard  deviation  long-term  features,  ,  iggo  leverage  real- 
time  interval  =  39  seconds) . 


